Assessing the quality of crowdsourced data in CollabMap from their provenance
In this notebook, we compared the classification accuracy on unbalanced (original) CollabMap datasets vs that on a balanced CollabMap datasets.
The CollabMap dataset is provided in the collabmap/depgraphs.csv
file, each row corresponds to a building, route, or route sets created in the application:
id
: the identifier of the data entity (i.e. building/route/route set).trust_value
: the beta trust value calculated from the votes for the data entity.
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv("collabmap/depgraphs.csv", index_col='id')
df.head()
Out[2]:
In [3]:
df.describe()
Out[3]:
In [4]:
trust_threshold = 0.75
df['label'] = df.apply(lambda row: 'Trusted' if row.trust_value >= trust_threshold else 'Uncertain', axis=1)
df.head() # The new label column is the last column below
Out[4]:
Having used the trust valuue to label all the data entities, we remove the trust_value
column from the data frame.
In [5]:
# We will not use trust value from now on
df.drop('trust_value', axis=1, inplace=True)
df.shape # the dataframe now have 23 columns (22 metrics + label)
Out[5]:
In [6]:
df_buildings = df.filter(like="Building", axis=0)
df_routes = df.filter(regex="^Route\d", axis=0)
df_routesets = df.filter(like="RouteSet", axis=0)
df_buildings.shape, df_routes.shape, df_routesets.shape # The number of data points in each dataset
Out[6]:
We now run the cross validation tests on the three unbalanced datasets (df_buildings
, df_routes
, and df_routesets
) using all the features (combined
), only the generic network metrics (generic
), and only the provenance-specific network metrics (provenance
). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.
In [7]:
from analytics import test_classification
We test the classification of buildings, collect individual accuracy scores results
and the importance of every feature in each test in importances
(both are Pandas Dataframes). These two tables will also be used to collect data from testing the classification of routes and route sets later.
In [8]:
# Cross validation test on building classification
res, imps = test_classification(df_buildings)
# adding the Data Type column
res['Data Type'] = 'Building'
imps['Data Type'] = 'Building'
# storing the results and importance of features
results_unb = res
importances_unb = imps
In [9]:
# Cross validation test on route classification
res, imps = test_classification(df_routes)
# adding the Data Type column
res['Data Type'] = 'Route'
imps['Data Type'] = 'Route'
# storing the results and importance of features
results_unb = results_unb.append(res, ignore_index=True)
importances_unb = importances_unb.append(imps, ignore_index=True)
In [10]:
# Cross validation test on route classification
res, imps = test_classification(df_routesets)
# adding the Data Type column
res['Data Type'] = 'Route Set'
imps['Data Type'] = 'Route Set'
# storing the results and importance of features
results_unb = results_unb.append(res, ignore_index=True)
importances_unb = importances_unb.append(imps, ignore_index=True)
This section explore the balance of each of the three datasets and balance them using the SMOTE Oversampling Method.
In [11]:
from analytics import balance_smote
In [12]:
df_buildings.label.value_counts()
Out[12]:
Balancing the building dataset:
In [13]:
df_buildings = balance_smote(df_buildings)
In [14]:
df_routes.label.value_counts()
Out[14]:
Balancing the route dataset:
In [15]:
df_routes = balance_smote(df_routes)
In [16]:
df_routesets.label.value_counts()
Out[16]:
Balancing the route set dataset:
In [17]:
df_routesets = balance_smote(df_routesets)
We test the classification of buildings, collect individual accuracy scores results
and the importance of every feature in each test in importances
(both are Pandas Dataframes). These two tables will also be used to collect data from testing the classification of routes and route sets later.
In [18]:
# Cross validation test on building classification
res, imps = test_classification(df_buildings)
# adding the Data Type column
res['Data Type'] = 'Building'
imps['Data Type'] = 'Building'
# storing the results and importance of features
results_bal = res
importances_bal = imps
In [19]:
# Cross validation test on route classification
res, imps = test_classification(df_routes)
# adding the Data Type column
res['Data Type'] = 'Route'
imps['Data Type'] = 'Route'
# storing the results and importance of features
results_bal = results_bal.append(res, ignore_index=True)
importances_bal = importances_bal.append(imps, ignore_index=True)
In [20]:
# Cross validation test on route classification
res, imps = test_classification(df_routesets)
# adding the Data Type column
res['Data Type'] = 'Route Set'
imps['Data Type'] = 'Route Set'
# storing the results and importance of features
results_bal = results_bal.append(res, ignore_index=True)
importances_bal = importances_bal.append(imps, ignore_index=True)
In [21]:
# Merging the two result sets
results_unb['Balanced'] = False
results_bal['Balanced'] = True
results = results_unb.append(results_bal, ignore_index=True)
In [22]:
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("talk")
Converting the accuracy score from [0, 1] to percentage, i.e [0, 100]:
In [23]:
results.Accuracy = results.Accuracy * 100
In [24]:
pal = sns.light_palette("seagreen", n_colors=3, reverse=True)
g = sns.factorplot(data=results, x='Data Type', y='Accuracy', hue='Metrics', col='Balanced',
kind='bar', palette=pal, aspect=1.2, errwidth=1, capsize=0.04)
g.set(ylim=(88, 98))
Out[24]:
For this application, training and testing classifiers on unbalanced data yields similar levels of accuracy, except for buildings. It seems that the heavily skewed buildings
dataset (87% vs 13%) inflates the performance of the building classifiers, hence the difference. If the classifier always predicts trusted, it is already 87% accurace on the unbalanced data (compared to the 50% baseline accuracy on a balance dataset with random predictions).